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(57) ABSTRACT 

Data to be analyzed is transferred from one or more user 
systems to a host system, which includes an analysis/ 
decision support module. Queries are generated, either auto- 
matically by the analysis/decision support module, or by the 
user, who then submits them to the host system. More than 
one user may participate in the system, including transfer- 
ring data to the host. This joint participation includes the 
option of collaboratively submitting or adjusting queries and 
viewing the results of the data analysis, either in real time, 
or asynchronously. Data used as the basis of an analysis may 
therefore come from different entities, even from data bases 
that are available publicly via the network, but whose 
owners are not participants in the collaborative, hosted 
analysis system according to the invention. The host system 
thus acts as a network portal through which different users 
may store and share not only data for analysis, but also the 
results of such analysis. 

6 Claims, 7 Drawing Sheets 
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SYSTEM AND METHOD FOR 
COLLABORATIVE HOSTED ANALYSIS OF 
DATA BASES VIA A NETWORK PORTAL 

CROSS-REFERENCE TO RELATED s 
APPLICATIONS 

This is a Continuation-in-Part of and claims priority from 
pending U.S. patent application Ser. No. 09/479,194, filed 
Jan. 7, 2000 which is a Continuation-in-Part of U.S. patent 
application Ser. No. 08/850,828 filed on May 2, 1997 now 
U.S. Pat. No. 6,014,661 which claims benefit of Provisional 
Application No. 60/019,049 filed May 6, 1996, 

BACKGROUND OF THE INVENTION 

15 

1. Field of the Invention 

This invention relates to a method and system for access- 
ing and automatically analyzing data in one or more data 
bases and for allowing at least one user to selectively view 
the results of the data analysis based on interactive queries. 20 

2. Description of the Related Art 

At present, when a user wishes to analyze the data in a 
data base, he faces the tedious task of entering a series of 
search parameters via a screen of input parameters. At times, 25 
the various queries must be linked using Boolean operators, 
and changing one parameter or operator may often neces- 
sitate changing many other less crucial parameters so as to 
keep them within the logical range of the input data set. 
Similar difficulties are now also arising when a user or a 3Q 
search engine scans many Internet sites to match certain 
criteria. 

Furthermore, the concept of "analyzing" the data in a data 
base usually entails determining and examining the strength 
of relationships between one or more independent data 35 
characteristics and the remaining characteristics. This, in 
turn, leads to an additional difficulty — one must decide what 
is meant by the "strength" of a relationship how to go about 
measuring this strength. Often, however, the user does not or 
cannot know in advance what the best measure is. 40 

One common measure of relational strength is statistical 
correlation as determined using linear regression techniques. 
This relieves the user of the responsibility for deciding on a 
measure, but it also restricts the usefulness of the analysis to 
data that happens to fit the assumptions inherent in the linear 45 
regression technique itself. The relational information pro- 
vided by linear regression is, for example, often worse than 
useless for a bi-modal distribution (for example, with many 
data points at the "high" and "low" ends of a scale, but with 
few in the "middle") since any relationship indicated will 50 
not be valid and may mislead the user. 

Another problem with existing data base analysis systems 
is that they are in general centralized, meaning that the data 
bases, the query and analysis engine, and the display system 
are all contained within the same general system, at the same 55 
site. This means that a user with a large data set but no 
powerful analysis engine must first find and install the 
engine before being able to study the data set. Along with 
such a standard solution to the problem comes the need to 
maintain the software. This solution is particularly inefE- 60 
cient when there is no on-going need to analyze the stored 
data. Moreover, if the user wants to analyze data in a data 
base not at his own site, but rather in a remote, possibly 
publicly available data base, then he would either have to 
hope that the remote site has proper data analysis software, 65 
or else he would have to acquire the data set and study it at 
a site that has the proper software analysis tools. This would 


195 Bl 

2 

be unwieldy at best and possibly impossible if the remote 
data base is very far away, or is distributed among different 
sites, or has a data set so large that importation into the 
user's own analysis system is impractical. 

Yet another problem arises where more two or more users 
wish to be able to share not only data, but also the ability to 
analyze it, and then perhaps even share the results with still 
other entities. If only one entity has the ability to analyze the 
data, then it will be difficult or impossible to allow others to 
help direct or otherwise participate in the analysis or its 
results. This makes it hard for different users in a single 
company to most efficiently develop and share results of 
analysis of data, especially when the users are at different 
physical sites. For example, researchers working in a large 
pharmaceutical corporation, as well as data they collect, are 
often located at facilities far away from each other. 

What is needed is a system that can take an input data set, 
select suitable (but user-changeable), software-generated 
query devices, and display the data in a way that allows the 
user to easily see and interactively explore potential rela- 
tionships within the data set. The query system should also 
be dynamic such that it allows a user to select a parameter 
or data characteristic of interest and then automatically 
determines the relationship of the selected parameter with 
the remaining parameters. Moreover, the system should 
automatically adjusts the display so that the data is presented 
logically consistently. 

The system should preferably make it possible for a user 
either to analyze remote data sets, or to analyze local data 
sets without needing to acquire and install specialized analy- 
sis software, or both. It should preferably still be possible to 
analyze local data bases even though they may be installed 
behind a so-called "firewall." 

It should also be not only possible but easy for users even 
at different locations to be able to access each other's data, 
and preferably to incorporate even other data into their 
analysis. Ideally, the participants in the analysis system 
should not have to be within the same organization; rather, 
it should be possible for people to collaborate in and share 
the results of data analysis even in the context of an 
extended/virtual enterprise, in which the participants may be 
spread across multiple organizations, and across multiple 
sites. As just one example, the system should easily accom- 
modate a research project involving a collaboration of 
research efforts by a pharmaceutical company, a biotechnol- 
ogy company, and a university research institution- It should 
be possible to readily share not only data, but even the 
results of the analysis of the data, such as visualizations, 
reports, computations, etc., preferably even with e-mail 
notification. This invention makes this possible. 

SUMMARY OF THE INVENTION 

The invention provides a method and a related system for 
processing data from at least one data base. The main steps 
of the method according to the invention are: 1) transferring 
to a host system, via a network such as the Internet, from at 
least one participating user system other than the host 
system, the data from the data base(s); 2) in the host system, 
analyzing the data from each data base according to an 
analysis routine and then generating analysis results; 3) in 
the host system, generating a representation of the analysis 
results; and 4) transferring the representation of the analysis 
results via the network for display on at least one partici- 
pating user system. 

In the preferred embodiment of the invention, a memory 
region is allocated in the host system for each participating 
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user system. Each memory region stores data from each data connected to the network, may then be analyzed in the 
base transferred via the network from each respective par- central host. Users may view the results of the analysis, 
ticipating user system to the host system. Each memory change parameters, and thus interactively analyze the data, 
region may also store at least address information indicating but may optionally do so collaboratively, and either in real 

the location of the transferred data within the host system. 5 ti me) 0 r asynchronously. Other users may add or remove 
The address information may include, for example, a net- data f rom tne ana iy S is, or change the viewing parameters, 
work address of at least one external data base that is based on the same initial data set; the system then allows 
accessible for downloading from a non-participating com- thcm tQ lorc othcr poss j b le relationships in the data, 
puter system that is connected to the network. In this case, 

each such external data base is accessed by the host system 1Q BRIEF DESCRIPTION OF THE DRAWINGS 

via the network and then downloads the external data base „ ^ .„ , „ , . 

data into a memory of the host system. Even when the data 1 lU ^ ates main com P onents . * a ° 

from the data base(s) is transferred from one participating s f tem «*xnding *c invention for retneving and dis- 

user source system, the representation of the analysis results P lavm g data from a data base - 

may be transferred to a the participating user systems other FIG. 2 illustrates examples of device queries that can be 
than the participating user source system. used in the invention's interactive display. 

The invention may operate with data base data stored or FIG. 3 illustrates the main processing steps the invention 
arranged according to any known data structure. In the follows for one example of the use of the invention to 
preferred embodiment of the invention, however, the data visualize relationships between data in a data base, 

base data is structured into records, each record having one 20 FIG. 4 illustrates a decision tree, which is one method that 
or more fields. Each field contains field data, has a field ca n be used in the analysis system according to the invention 
name and one of a plurality of data types. Given this data m or d er to determine and define the structural relationship 
structure, a decision support module in the host system between different data fields in a data base, 
according to the invention then automatically selects an FIG 5 inmates a display of the results of a data base 

initial, adjustable, graphical query device as a function of 25 analysis using the ana iy S is system according to the inven- 
and adapted to a type and range of the corresponding field uon 

data. Each graphical query device is then transferred via the ' _ , . L1 . , , . 1t t 4 t , 

, 4 4 , . * t-uu* FIG. 6 is a block diagram that illustrates a system accord- 

network to at least one participating user system. 1 tie host . 4 . . t . . °,. , t , , . / . . 

.5. ♦ 1 j- * * u *u ine to the invention in which the analysis system is remotely 

system then senses, via the network, adiustment by the user 1 * • . 1 . ± *± * . . , , 

c . ^- ■ * * !_• u u u-1 accessible via a network so that it can be used to analyze data 

of each participating user system to which each graphical -, n . , 4 , 4 , ... . \ 

, r . L . . r j c c f , JU m a data base at a user s site, or at a third-party site 

query device has been transferred of any or the displayed, ~ , . t , , , , ' r / 

j. 4 <i .1 j • t-u u * * It. accessible via the network, or both, 

adjustable, graphical query devices. The host system then 

updates the representation of the analysis results correspond- FIG * 7 * a block diagram of a remote-hosting embodi- 
ing to the sensed adjustments of any of the query devices, ment of the invention, in which the analysis system is 

thereby enabling interactive visualization of the analysis 35 centrally hosted and remotely accessible via a network to 
results of the data via the network. At least one of the user an Y nmnber of users. Data to be analyzed is either stored 
systems to which graphical query devices are transferred within a user memory space in the central host, or it may be 
may be one of the participating user systems other than the imported via network links, or both. 

source user system. DETAILED DESCRIPTION 

A log may be maintained, preferably in the user- 40 
associated, allocated memory regions, of accesses to the data The invention is well suited for interactive visualization 
stored in the respective memory regions. The host system and analysis of data from any type of data base. Just a few 
may then notify, via the network, each user whose corre- of th e thousands of possible uses of the invention are the 
sponding data, stored in the respective memory region, is visualization and analysis of financial data, marketing data, 

accessed by any other participating user. 45 demographic data, experimental data, environmental data, 

The decision support/analysis module in the host system dala > World-Wide Web log files, manufacturing 

may implement any known data analysis routine. In the case data > biostatistics, geographic data, and telephone traffic/ 
where each data base contains a plurality of records and each usage data. 

record includes a plurality of data fields, however, the The invention includes a data analysis module or 

decision support module may analyze the data from the data 50 "engine," and various embodiments, each with a different 
base(s) by automatically detecting a relational structure system configuration in which the location of the data 
between the data fields by calculating a respective relevance analysis engine, of the various user systems, and of the data 
measure for each of the data fields. The relevance measure 10 be used as the basis for the analysis differ. These are 
is preferably a data type-dependent function indicating a described in turn. The preferred method and system for 

measure of relational closeness to at least one other of the 55 automatically analyzing data are described first, 
data fields. The host system then generates a graphical The main components of the simplest configuration of the 
representation of the relational structure and transfers this system according to the invention are illustrated in FIG. 1. 
graphical representation via the network for display on at A data base or data set 100 (one or more) may be stored in 
least one participating user system. any conventional devices such as magnetic or optical disks 

Results of the data analysis may be generated and pre- 60 or tapes and semi-conductor memory devices. The size of 
sented in many different forms, such as on-screen the data set may be arbitrary, as the invention has no inherent 
visualizations, reports, computations, etc. User systems then limitations in the size of the data base it can access and 
communicate with the host system, preferably via a publicly analyze. 

accessible network such as the Internet, or via a proprietary A main processing system 110 may be implemented using 

network such as are found within some enterprises, in many 65 a microprocessor, a mini- or mainframe computer, or even a 
cases via a browser. Data stored not only in the user space, plurality of such processors operating in parallel or as a 
but, optionally, even imported from external data bases pipeline. The processing configuration may be chosen using 


02/24/2004, EAST Version: 1.4.1 


US 6,405,195 Bl 

5 6 

normal design techniques based on the size of the largest characters, and special characters or codes, lists and strings, 

data set one expects to have to process, and on the required Boolean codes, times and dates, and so on. In a data base of 

processing times. The processing system 110 includes, films, for example, each record may have data concerning 

among other sub-systems, a sufficiently large memory 112 to toe lit l e and me director's name (alphanumeric attribute), the 

store all data used in the data classification and display 5 release year (an integer), whether the film is a comedy, 

procedures described below. The processing system 110 drama > documentary, action film, science fiction, etc. 

forms the analysis system that enables a user to query one or (marked in the dato base as an integer or alphanumeric 

more data bases and view results of the data classification ™de), whether the film won an Academy Award (logical) or 

according to the user's queries. *° w m ^ lt w ° n (m^ger). As is described in greater detail 

, . - n below, the system according to the invention preferably 

The data set 100 (that is, its storage device) is connected ™ automatically type-classifies the various fields in the data 

for communication with the main processing system by base records. In certain cases, however, the data base itself 

means of any conventional channel 114, which may be a ( in the initial prot ocol and structural information) may also 

dedicated channel, or a general-purpose channel such as indicate the types of the various fields of the records; in such 

telephone lines (for example, for connection through a cas6) me processing system may not need to type-classify 

network, including the Internet), fiber-optic or high- ™ me fie ] ds and can omit mis stcp 

bandwidth metal cables, or radio or micro-wave links. The ^ For each record ^ mat me processor has classified, it 
size of the data set and the desired processing speed wiU in then (or ^multaiieoiisly) determines the range of the data for 
general determine which channel is appropriate and fast each field m me ^ ^ can be dooe in any of several 
enough. The data set need not be remote from the processing standard waySj and different methods may be used for 
system although this is the case in while or m part in some 20 different data types Fof numerical data ^ t he system may 
embodiments of the invention, which are described and simply search through lhe set l0 delermine lhe ma ximum 
illustrated below. Rather, the data set's storage device 100 and mimmum values as well ^ ( if nee ded), the average or 
may even be a peripheral memory device connected directly med j an values to aid in later centering and scaling of a 
to the processing system 110. . . « corrcs P orldin S displayed query device. The system prefer- 
In most applications of the invention, the user will wish ably also counts the number of different values in each set 
to see a graphical display of some feature of the data set. of fields in all of the records in the data set. Ranges may also 
This is not, however, necessary— the invention may be used be predetermined; for example, if the user wishes to include 
as a sub -system that queries the data base 100 and organizes m the data base search data records sorted alphabetically by 
the data for a supervisory routine, which then processes the surnames of Americans or Britons, then the range of first 
data automatically in some other way. For example, the letters will be no greater than A-Z (a range count of 26), 
invention may be used in a system that automatically although a search of the actual records in the database might 
generates lists of potential customers of a product chosen show that the range of, say, A-W is sufficient (with a range 
from a large data base of consumer information. In the count of 23). Names (or other text) in other languages, the 
typical case, however, the results of the data processing alphabetical range and range count may be either greater or 
using the invention are to be interactively displayed and to smaller; for example, Swedish text could begin on any of 29 
that end, a display unit 120 is preferably connected to the different letters (A-Z, A AO). 

main processing system 110. The display unit may be a 4) ^ system then analy2es ^ relationa] structure of the 

standard device such as a computer monitor or other CRT, data recordsusing any or al] of a plurality of methods. These 

LCD, plasma or other display screen Standard display methods mdude regression dec ision trees, neural networks, 

drivers (not shown) are included in the display unit 120 and & 1q ^ and m Qn Accordin t0 the inven tion, the 

are connected to the processing system 110 in any conven- system preferably applies more than one method to deter _ 

tional manner. mine the structure 0 f trie dala and men either selects the 

A conventional input system 130 is also connected to the « best » met hod in some predetermined sense, or else it 

main processing system in the normal case in which the user 45 prese nts the results of the different structural determinations 

is to select initial data classification parameters. The input to me USCTj who then may then select one that appears to give 

system may consist of a single standard positional input me best result. 

device such as a mouse or a trackball, or an alphanumeric ^ 0nce the system ha& determined the data field types and 

input device such as a keyboard, but will normally include rangeSj the system determines a user interface to be dis- 

both types of devices. The display unit 120 itself may also 50 played Qn the display ^ 120 The results of the structural 

form part of the input device 130 by providing it with relational analysis are also preferably used to order the 

standard touch-screen technology. The connection and inter- varkms quefy devices tnat are dis?hytd to the user to give 

face circuitry between the input system 130 and display unit him guidancc m findin g the strongest relationships among 

120 on the one hand and the processing system 110 on the the various fields of the data base ^ iaitrUc& pre ferably 

other hand may be implemented using standard components 5S automa ti C ally selects (at least initially— later, the system 

and is therefore not described further here. automatically presents alternatives to the user, from which 

The main procedural steps carried out by the invention are n e may select) the lay-out of query devices (described 

as follows: below), coordinate axes (either automatically or under user 

1) The main processing system 110 accesses the data base control) and scales, display colors and shapes, the degree of 
100 in any known manner and exchanges standard protocol 60 "zoom" of the display (if needed), or other features depend- 
information. This information will normally include data ing on the particular application and user preferences, 
indicating the size of the stored data set as well as its record 6) In many cases, there will be so many records in the data 
and field structure. base that it would take too long for the system to search 

2) The processing system then downloads records and through all records in the data base to determine the record 
classifies them by type (also known as attribute). Some 65 type or range. The invention therefore preferably includes 
examples of the many different possible types of data the procedural steps of first determining the number of 
include integers, floating-point numbers, alphanumeric records in the data base, and, if the number of records 
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exceeds a predetermined threshold, sampling the data set to 
determine the record type and to extrapolate its range from 
the range of the sample. Conventional techniques may be 
used to determine the number of records and their type; 
indeed, in many cases, the number of records is a parameter s 
included in the data base itself. The threshold for sampling 
may also be determined using conventional design criteria 
and will include such factors as the available time allowed 
for data transfer and processing, which will in part be 
determined by the speed of the chosen processor. 10 

Different sampling techniques may be used. For example, 
every n'th record can be examined (where n is determined 
by what percentage of the records the system can examine 
in a given time); or a predetermined percentage of the 
records can be selected randomly; or records may be 35 
sampled randomly until a predetermined statistical signifi- 
cance has been achieved, etc. Once the sampling process has 
been completed, the entire data set can be downloaded and 
processed for display using the type and range classifications 
of the sample. 20 

7) In a remote processing embodiment of the invention, 
the data to be classified, analyzed, and displayed is located 
at a local user's site, or in a data base that is accessible via 
a network such as the Internet, or both, but the data is 
accessed and processed as above at a remote site. 

As FIG. 1 illustrates, the steps of type detection, range 
determination, structure identification, interface selection 
and sampling may be carried out in dedicated processing 
sub -systems 140, 142, 144, 146 and 148, respectively. Note 3Q 
that many of the processing steps described above (for 
example, type/range determination and interface selection) 
can be carried out in parallel as well as in series. Assuming 
the chosen processing configuration is fast enough, however, 
all of these steps, or any combination of steps, may be 35 
carried out by the same processor. 

In most conventional systems, data base searches are 
interactive only in the sense that the user's initial "guess" 
(search profile) can be modified and re -submitted — the user 
is given little or no guidance or indication of the size and 40 
range of the data involved in the various possible choices for 
the search profile. As such, the user might, for example, 
initially submit a search profile with no possible "matches." 
By initially analyzing the data set to determine data types 
(attributes) and ranges, the invention is able to create an 45 
initial query environment that allows the user to avoid such 
wastes of time. Even further time-saving procedures unique 
to the invention are described below. 

FIG. 2 illustrates examples of dynamic query devices that 
the processing system may generate on the screen of the 50 
display unit 120, depending on the type and range of the 
data. The various data query devices are generated and 
located for display using any known software, such as is 
readily available for writing display software for Microsoft 
Windows applications or similar software packages. One of 55 
the most useful query devices is the slider, which may be 
either a single-slider query device 200 for indicating single 
alphabetical or numerical characters, or the rangeslider 
query device 210 for indicating ranges of alphabetical or 
numerical characters; two-dimensional single and range go 
sliders may also be used. 

In FIG. 2, the attribute of the data field associated with the 
single slider 220 is alphabetical. One sees in this example 
that the data in the indicated field has relatively very few 
"A" entries, relatively many "B" entries, few or possibly no 65 
"J" entries, many "S" entries, and so on. The user can also 
see that there are no "X", "Y", or "Z" entries — the upper 
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alphabetical limit "W" will have been determined during the 
range detection step. In the illustrated example, the user has 
manipulated a standard screen cursor 220 (for example, 
using a trackball or mouse included in the input system 130) 
to move the slider 230 approximately to the right-most range 
of the "B" entries. In other words, the user is requesting the 
system to find data records for which the corresponding data 
set records start with a "B". 

The "scale" of the alphaslider need not be alphabetical; 
rather, instead of the letters "A", "B", . . . , "W" and so on, 
the system could display numbers, one of which the user is 
to select by "touching" and moving the slider 230. The user 
may move the slider 230, for example, by placing the tip of 
the cursor 220 on it and holding down a standard mouse 
button while moving the mouse to the left or right, releasing 
the button when the slider is at the desired value. 

The illustrated range slider query device 210 has a scale 
the same as the single-valued slider, but, as its name 
indicates, is used to select a range of values. To do so, the 
user "touches" either the left range slider 240 or the right 
range slider 242 and moves it as with the slider 230. In the 
illustrated example, the user has selected a query such that 
only those relations should be displayed for which the 
chosen attribute has a value between about 13 and about 72. 
Excluded values are here displayed "shaded" on the slider 
query device. 

Many variations of the illustrated sliders may be used in 
the invention, such as those that indicate which values are 
not to be included (for example, by "clicking" on an 
appropriate portion of the slider display to indicate by 
shading that the logical complement of that portion of the 
range is to be applied), that indicate ranges inclusive at one 
extreme but exclusive at the other (for example, by clicking 
on the range marker to toggle it to different logical states), 
and so on. A more complete discussion of the possibilities is 
given in the inventors' papers "Exploring Terra Incognita in 
the Design Space of Query Devices," C. Ahlberg & S Truve, 
Dept. of Computer Science and SSKKII, Chalmers Univer- 
sity of Technology, Goteborg, Sweden; the article "The 
Alphaslider: A Compact and Rapid Selector," C. Ahlberg & 
B. Schneiderman, Proceedings, ACM SIGCHI '94 Apr. 
24-8 1994; and "Dynamic Queries for Information Explo- 
ration: An Implementation and Evaluation," ACM SIGCHI 
'92, May 3-7 1992. 

The illustrated example also shows a toggle 250 on which 
the user has "clicked" (for example, in the standard way, by 
"touching" the toggle with the cursor on the display screen 
and pressing a mouse button) to indicate that the feature "Y" 
should be present in the displayed data. That the toggle is 
"on" may, for example, be indicated by the processing 
system by displaying it darker, by superimposing a cross 
("X") on it, or in some other conventional way. 

A checkbox 260 contains more than one toggle. In the 
illustrated example, features B and C have been selected for 
inclusion as a data query, whereas A and B have not. A 
displayed dial 270 is yet another example of a query device. 
Using the cursor, the user pulls the pointer 272 clockwise or 
counter-clockwise and the system displays the value (in the 
example, "73") to which the pointer is currently pointing. 
Other query devices may be used, for example, pull -down 
menus and two-dimensional sliders (for example, one on an 
x-axis and another on a y-axis). 

FIG. 3 is a block diagram that shows not only the main 
paths of data flow in the invention, but also is a more 
detailed functional block diagram of the system shown in 
FIG. 1. Reference numbers for the functional blocks are the 
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same as those in FIG. 1. Furthermore, the sampling sub- 
system 148 has been omitted from FIG. 3 since its operation 
is described above and since it will simply reduce the 
number of data records initially passed on to the type 
detection, range detection and possibly the structure detec- 
tion sub-systems described below. 

As is mentioned above, the invention can be used for 
many different types of data bases and data base structures. 
Merely by way of example, however, assume that the user 
wishes to analyze the possible relationships found between 
various items in a data base of customer purchases for a 
chain of stores that sell clothing, shoes, and cosmetics. Such 
data might be compiled automatically, for example, for all 
customers who use the store's own credit card. This situation 
is illustrated in FIG. 3. 

As is common, the data base 100 is organized as a series 
of records Rl, R2, . . . , Rm, each of which has a number of 
fields Fl, F2, . . . , Fn. In this example, there are ten fields 
per record (here, n=10but the number of fields per record in 
actual data bases may of course be greater or less than 
ten — the invention is not dependent on any particular num- 
ber of records or fields). The fields (Fl, F2, . . . , F10) in the 
example are: Fl) an identification code for the customer 
associated with the record; F2), F3) and F4) the customer's 
name, age and sex, respectively; F5) the total amount the 
customer has spent (during some predetermined period); 
F6), F7), and F8) the amount the customer has spent on 
clothing, shoes and food, respectively, during this period; 
F9) the date of the customer's most recent purchase; and 
F10) a number representing how frequently the customer 
makes purchases (for example, measured in transactions per 
month). 

The illustrated data base also includes standard protocol 
data as well as field names associated with the different 
fields. The protocol data will typically include data indicat- 
ing the total number of bytes (or words) the data base 
contains, how many records, how many fields per record, 
and how many bytes (or words) each field consists of. If the 
protocol is already standardized or otherwise pre- 
determined between the data base 100 and the main pro- 
cessing system, then there will be no need for the protocol 
fields. Moreover, the field name data will not be necessary 
if it is already established in some other conventional 
manner for the user or the main processing system of the 
invention what data the various record fields represent. 

In the preferred embodiment of the invention, the main 
processing system 110 automatically detects the type of data 
in each of the record fields, unless the data types are already 
specified by the data base in the protocol or field names data. 
This may be accomplished using any known data type 
detection routine, as long as the number of records in the 
data base is large enough to allow the detection routine to 
make statistically relevant deductions about the data. For 
example, in order to detect the type of data in field Fk (k-1, 
2, . . . , m), the processing system may access (that is, 
download in bulk, read in and process sequentially, etc.) all 
of the field data (Rl, Fk), (R2, Fk), (Rm, Fk), where (Rj, Fk) 
indicates the k'th field of the j'th record. Any of many 
different known tests may then be applied to determine the 
data type. 

For example, if all (or more than a pre-defined 
percentage) of the bytes of all of the fields Fi (that is, field 
Fi in all of the records) correspond to binary numbers from 
97-122 and 65-90, then the system may assume the field 
contains data with an alphabetical (string) attribute (type), 
since these are the ranges of the ASCII codes for the 
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English-language alphabet (upper and lower case, 
respectively). In the example shown in FIG. 3, this would be 
the case for F2; Name. If only two different values are 
detected (especially, 00000000 and 00000001), then the 

5 system may, for example, assume that the corresponding 
field contains Boolean data or; if the two values also fall 
within the ASCII alphabetical range, then the system may 
instead (or, temporarily, in addition) mark the field as an 
alphabetical field. Field F4 might thus be either Boolean (if 

10 the Field name is "Woman?" men F4 might indicate either 
"yes" or "no" with binary numbers 1 and 0) or a single - 
element string alphabetical ("F* for female, "M" for male). 
Using known methods, the system will similarly distinguish 
between integers and floating-point numbers, often by a 

15 knowledge of the field structure itself from the protocol 
data — integers are typically represented by single data 
words, whereas floating point numbers will typically require 
two separate data words for the whole -number and decimal 
portions. Indications of the data types are then preferably 

20 stored in the memory 112 as a data type table 340 in the 
memory 112. In FIG. 3, field Fk has been identified as 
having data type Tk, (k-1, 2, . . . , n). 

For each field, the range detection sub-system 142 deter- 
mines upper and lower limits. For numerical fields, for 

25 example, this will typically be the maximum and minimum 
values. For string data, however, this will typically be the 
letters closest to either end of the alphabet. The number of 
different values is preferably also accumulated for each field. 
Additional data may also be tabulated as desired or needed. 

30 For string data, for example, for each string field, the system 
might accumulate a separate table of the number of times 
each letter of the alphabet occurs first in the field in order to 
reduce clutter in the later display by eliminating non- 
occurring letters. The median of the occurrence table may 

35 then be calculated and used for later centering of the scale 
of the associated query device (see below). For numerical 
fields, the range detector 142 may additionally calculate 
such statistical range data as the mean, median, and standard 
deviation of the field data. All calculated range data is then 

40 preferably stored in a data range table 342 in the memory 
112. In FIG. 3, field Fk has been identified as having range 
data Mk, (k-1, 2, . . . , n). 

As is mentioned above, the type- and range-detection 
sub-systems 140, 142 may operate either in series or in 

45 parallel. Even with a single processor implementing both 
sub -systems, these two sub-systems may operate "simulta- 
neously" in the sense that each operates on a single data 
value before the next is processed, in order to reduce 
processing time by having only a single download of the 

50 data. For example, the range detector 142 may use each 
accessed data word as soon as the type detector is finished 
with it and then use it in the on-going, cumulative calcula- 
tions of minima, maxima, means, and all other range data for 
the corresponding field. Once the type detector determines 

55 the data type for each field, the range detector can then 
discard range data calculations that are inappropriate to the 
detected type. For example, in general there will be no need 
for a calculation of the median or mean of Boolean or entire 
strings of data (although, as is mentioned above, there may 

60 be for first letters). 

Once the type and range of the fields have been 
determined, the system then automatically determines vari- 
ous relationships between the different data fields. 
Preferably, several different methods are used, from which 

65 the system initially selects a "best" method in a predeter- 
mined sense, and also orders query devices in such a way on 
the display that they indicate to the user which relationships 
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are the strongest. In some applications, the user knows 36-55 year-old women who purchase 6-10 times per period 

which data characteristic (data field) the relationship deter- ($2630). By summing "upward" all branches at each level, 

ruination is to be based on. For example, the user might the system can determine the total average purchases of all 

wonder which type of purchase (clothing, shoes, or men/women whose frequency is 6-10, then for all men/ 

cosmetics) seems to be most highly dependent on the age of 5 women and then by traversing the tree "downward" ihe 

the customers. In such cases, the user will indicate this to the S y Stc m can pick the path (order of variables) that gives the 

processing system by entering the independent variable— in greatest total average spending, the next greatest, and so on. 

the assumed example, "age"— via the input system 1 130 Notc that the dedsion trcc structure is not limited to numeri- 

(The processing system may, for example, display a list of » ordered fields . 

the field names on the display, from which the user can select 3 

in any standard way.) The preferred embodiment of the 10 In another > straightforward structural description of the 

invention is not restricted, however, to beforehand knowl- data base the system compiles and inspects the distribution 

edge of which data field is to be the independent variable in of distinct values (or number of values in a plurality of 

order to determine the structure of the data set, although this distinct intervals). This may be done either independently or 

will in general reduce processing time and memory require- in conjunction with the construction of, for example, a 

ments, 15 decision tree. 

One method of determining the structure of the data set is Yet another way to determine the structure of the data base 

statistical correlation, either directly, using standard is by using a neural network. The theory and construction of 

formulas, or in conjunction with determining the regression neural networks is well documented and understood and is 

(especially linear) parameters for any two selected fields of therefore not discussed in detail here. Of note, however, is 

data. For each possible pair of different data fields, the 20 that neural networks must in general be "trained" to stabilize 

system calculates the statistical correlation and stores the on known data scts beforc ^ can be uscd Qn « actual „ data 

resulting correlation values in a correlation matrix in the setg Jn ^ CQntext of ^ inventi the ^ of a neural 

memory. The system then identifies the maximum correla- netwQrk ^ thus norma]1 d d Qn beforehand 

uon value for each field taken in turn as the independent . , , r . . ... . J c r , t . , , , , 

variable, and orders the remaining variables in order of M toowledgeof at least me type of date m thedatabase, sottiat 

decreasing correlation. Note that statistical correlation will a T^* set of traming data can be compried and used to 

in general be a meaningful measure only of the relationship train the network. Here, suitable means of the same 

between sets of quantitatively ordered data such as numeri- S eneral tv P e > distribution, and with the same general data 

cal field data relationships as those actually presented to the processing 

Moreover, if the user indicates which of the m variables 30 in ^ m& ' In ^ any cases this will not be possible; 

(that is, which of the m fields) is to be used as the indepen- m others ' howev ^ l{ °* en 1S > for exam P le > the dat * 

dent variable before the system begins correlation ba f e 15 ™™rical and dependent on an underlying set of 

calculations, then the system need only calculate and order ^stantiaUy constant rules or natural laws such as meteo- 

the resulting (m-1) correlation values. If the user does not rological data. 

indicate the independent variable for this or any other 35 Assuming some other method is first applied to determine 

structure-determining method, then the system may simply membership functions for the different variables (data 

assume each variable to be independent in turn and then fields )> lo 8 ic techniques may also be used to measure 

calculate correlations with all others; the greatest correlation the strength of relationships among pairs or groups of 

found can then be presented initially to the user. variables. 

Another method for determining structure is the decision 40 0ther structure-determining methods include predictive 

tree, which can be constructed using known methods. See, rule-based techniques, which are described, along with still 

for example, Data warehousing: strategies, technologies other methods, in Data warehousing: strategies, technolo- 

and techniques, Ron Mattison, McGraw Hill, 1996. As an ™d techniques, Ron Mattison, McGraw Hill, 1996. 

example, consider FIG. 4, and assume that the independent Each different method for determining the structure of the 

variable of interest to the user is F5, that is, total spending. 45 data corresponds to a particular measure of what is meant by 

In the illustrated example, the structure sub-system deter- the "closeness" or "strength" of the relationship between 

mines that 30% of the data records are for men and 70% two or more data fields. In many cases, only one of the 

correspond to women. Note that this data will preferably different structure -determining methods in the sub-system 

already be available in the range data table under entries for 144 will be suitable for the detected data types. For example, 

number of occurrences of each state of each field. Note also 50 statistical correlation may be the most suitable method if all 

that variable values may be defined as intervals, not only as of the data fields correspond to numerical data, whereas 

individual values; thus, solely for the sake of simplicity of decision trees will normally be more efficient for ordering 

explanation, in the illustrated example, frequency data (field fields of strings or Boolean data. "Suitable" and "efficient" 

F10) is given as one of three intervals: 0-5, 6-10, and more may be defined and calculated in any predetermined, known 

than ten purchases per time period, both for males and for 55 sense to determine a validity value indicating the validity of 

females. For each frequency and for each sex, the data is the corresponding measure. Furthermore, in many cases the 

further branched into age intervals: under 20, 21-35, 36-55, methods themselves will reveal their own unsuitability. For 

and over 55. (The decision tree will normally continue to example, if almost all data field pairs have statistical corre- 

branch further in order to include the possibilities for the lation near zero, then a different method, such as a decision 

other fields, but these have been deleted in order to simplify eo tree, is almost certainly indicated. 

the discussion, with no loss of generality.) The total average Common to all the structure-determining techniques 

spending for each branch is indicated at the tip of the applied by the structure subsystem is that the sub -system 

lowermost branch (the independent variable). For example, determines a measure of relevance for each data field. In 

the total average spending of the group of 36-55 year-old some cases, the relevance measure for a given field may be 

men who purchase 6-10 times per time period is $704. 6 5 wholly independent of other fields. For example, one 

Given the illustrated tree's ordering of branches, one can straightforward measure of relevance might be a count of 

see that the highest level of average total spending is for how many fields have a certain value, or how many distinct 
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values the field holds. This might be very relevant, for value marks (AB CD... 

example, in evaluating the sales of some particular product, E.K L MNPR S W) can be displayed 

regardless of other sales. adjacent to the slider, and centered on the previously cal- 

In other cases, the measure of relevance may be a measure culated median or average value, 

of dependence of some set of dependent, secondary van- s -r^ chosen query deviccs for the diffcrent dala fickl 

ables (fields) on some base, independent variable (field) variables are then pre ferably displayed on the display in the 

selected either automatically or by user input. One method Qrder of d dcncc on mc choscn indepCQdent variable . 

for amomauc selection woidd be to ™ * ** independent ^ qt used feraM at fcagt ^ as detef . 

field the same field ultimately selected by the user during a . , , ° . . r , . J ' ...... , , . . 

previous evaluation of the same or a similar data base or 10 mm f ' he stracture-detectmg method that calculated the 

ihrough user input. Another automatic method would be for 10 most dependence relationship m any pre-defined 

«t_ . . u * j* • *- * _ sense, that is, has the greatest validity value. For example, an 

the system to be connected to an existing expert system, . ' . ' ^ c »uiuv.iuiwAauipiv, 

which then selects the independent field. Yet another auto- 1 ? d ^ 10n of < he na ™ of he ^dependent variable (that is, 

matic method would be for the system to determine all lts * eld *™f> m ^ be " 501116 P^nent position 

possible pairs (or some predetermined or heuristically 15 oa he screen and the other query devices are then 

determined number of pairs) of fields, then evaluate the 15 Preferably positioned top-to-bottom,left-to-nght, or m some 

relevance measure for each pair, and then order all the ° ther f ultlve ^ s ° ™ l ° ^cate decreasing measured 

results for user evaluation and selection. Statistical correla- de P endence on the independent variable. 

Uon (alone, or in conjunction with a linear regression or 0nce ^ <l uer y devices are sorted and displayed, the 

other curve-fitting routine) is one example of a measure of 2Q system preferably also displays an initial plot (for example, 

relevance that is based on a measure of dependence. X " Y > P ie chart > bar graph, elc ) of ^ relationship. The initial 

Once the system has calculated the relevance measure for ^ of P lot > its scalirj g' color schcme > ma *er type, size, and 

each of the fields, then it preferably presents the results to the °^ QT features— in short, the view selection— are preferably 

user by displaying the corresponding field names (or some selected m an y conventional manner, 

other identifier) in order (for example, decreasing) of their 25 FIG. 5 illustrates a simplified display screen correspond- 

relevance measures. Where the relevance measure involves ing to one possible set of data processed by the invention 

dependence of secondary fields on a chosen base field, then using the earlier example of a data base of sales statistics. In 

the system preferably displays an indication of which field the example, the system has determined that the strongest 

is the base field and in what order the other fields depend on relationship, given the independent variable age, is with 

it. The dependent variables are, for example, ordered in 30 purchase frequency, followed by sex, then recency, and so 

terms of decreasing dependence so that the user is given which is indicated to the used by displaying the corre- 

guidance as to which relationships may be of greatest sponding query devices vertically in descending order. In the 

interest. (As is described below, the user can change the example, rangesliders were indicated and automatically 

order of presentation and the plotted, relationship- selected for the fields "Frequency" and "Recency", whereas 


since 


visualizing display.) 35 toggles were chosen for each of "Male" and "Female 

At any time after the system has determined the type and they can be plotted non-exclusively using different data 

ranges of the various data fields, the system proceeds with markers, for example "A" and "O". The structure detection 

query device selection. Consider once again FIG. 2. In the ™\hod (measure) with the best validity value is displayed in 

preferred embodiment of the invention, the initially pre- display region 500 as the decision tree ("TREE"). The 

sented query device will depend primarily on how many 4 o default P lot ^ fe shown in re S ion 502 & an XY P lot * B y 

different possible values a data field can assume. The activating, for example, conventional pull-down menus such 

thresholds for selecting the different query devices will be as 530 > 540 and 550 > the user may direct the system to 

predetermined and pre-programmed into the system, but can chan g e the q wr Y d&wicG for any given field, the measure to 

be changed under user control after initial presentation (for be used lo determine the order of dependency of the depen- 

example, by activating a icon of the desired query device 45 dent variables (data fields), for the plotting the plot type or 

and then "dragging" it to the currently displayed query c °l° r scheme, etc. 

device, by activating and selecting from a pull-down menu Using a pull-down menu, the user had selected "AGE" as 

adjacent to the currently displayed device, or by using any the independent variable, and, using a different pull-down 

other known technique for changing portions of a graphical menu 506, indicated to the system that TOTAL PUR- 

user interface). For example, if the data type if Boolean, a 50 CHASES should be plotted against AGE. Rangesliders 508, 

toggle may be predetermined to be the initial query device 510 are preferably displayed on the respective x- and y-axes 

selected. For string and/or numerical data with fewer than, to allow the user to adjust (by moving the range markers 

for example, seven different values, a checkbox or pull- with the cursor 220) the displayed ranges. In the illustrated 

down menu may be the default query device. For fields example, the system plot only the data for which the 

(variables) with more than some predetermined threshold 55 frequency lies in the range 2-25, the recency lies in the range 

number of different values, however, the default query 0-24, since the user has moved the range markers of the 

device may be a single or range slider, depending on the data respective range sliders accordingly, 

type* Using known techniques, the system continually senses 

More detailed discussion of query device selection is the state of all toggles, range and alphasliders, etc. and 

disclosed in the inventors' article (also mentioned above) 60 whenever a change is detected, it re-plots the selected 

"Exploring Terra Incognita in the Design Space of Query relationship to include only the desired data characteristics. 

Devices," C. Ahlberg & S Truve, Dept. of Computer Science For example, if the user were to "click" on the toggle for 

and SSKKII, Chalmers University of Technology, Goteborg, "Male" ("M"), so that it is de-selected, then the system 

Sweden, which also discusses the scaling of sliders as a would remove the "A-marked" data points on the plot 520. 

function of their range. For example, the lower limit of the 65 As the user changes the settings of other query devices, the 

data values may be placed at the left end of the slider scale, system updates the display accordingly to include only the 

the upper limit at the right end, the different gradations or field data that falls within the indicated ranges. This allows 
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the user to view and change the data base presentation ture" PKI schemes) for data transfer to and from the enter- 
interactively, so that there is no break in concentration and prise 600. Other systems communicating with the enterprise 
exploration of the data base for time-consuming 600 will then include similar portals and transfer data using 
re-submission of conventional queries. the same encryption standard. The data communication will 

Administrative information such as data set file name, the S then be secure and private, even though it is talcing place 

number of records, and the date and time may also be over a (preferably) public network 614. This arrangement is 

included on the display screen as desired and as space known widely as a "virtual private network" (VPN). This 

permits. invention also works well with multiple, collaborating sites 

Remote Processing Embodiment or entities, any or all of which may exist behind respective 

According to a further embodiment of the invention, the 30 firewalls and communicate via the public network, usually 

input unit 130 and the display 120 are located at a user's site via a secure VPN arrangement. Such cooperating, collabo- 

that is remote from the processing system 110 itself. The rating entities are widely known as an extended or virtual 

data base (or several data bases, depending on the enterprise. 

application) itself is located either at the user's site, or at one Other enterprises, such as most individuals and small 

or more third -party sites that are accessible to the processing is organizations, do not have a firewall, but rather allow direct 

system. In other words, the data to be analyzed is located connections between the individual computers within the 

locally, that is, at the user's site or at a third -party site organization and the network 614, or between an internal, 

designated by the user, or both), as are the devices needed to local network and the public network 614, or both. This 

submit queries and view the results, but the actual process- remote processing embodiment of invention will work with 

ing is carried out remotely, and is accessed via a network. 20 either arrangement, and the term "enterprise" is used here to 

This allows a user who does not locally have the data denote any such user, whether an individual computer 

analysis capability provided by the processing system system, or an entity with or without an internal network of 

according to the invention to still have the benefit of it. The several connected computers communicating either inde- 

general configuration of this embodiment of the invention is pendently with the external network 614 or only through a 

illustrated in FIG. 6. 25 common server, and either with or without firewall protec- 

An enterprise 600 (any number of which may be included tion and with or without a hardware and/or software com- 

in the invention) is a local system, that is, located at a user's ponent providing VPN capability, including extended or 

site, that includes at least one (and often many) local virtual enterprises. 

processor or processing system 610, typically a network The only requirement is that at least one local processing 
server, one or more data bases DB1, . . . , DBn, conventional 30 system 610 should be able to connect to the network 614, 
browser software whose results can be viewed on a conven- either directly (using any known technology such as dial-up 
tional display 620, and a conventional input device 630. The connections, DSL, satellite, etc.) or indirectly, via one or 
display 620, which is preferably controlled by a browser or more intermediary servers, such when using a corporate 
similar software, and input device 630 in this embodiment server and/or a third-party Internet Service Provider (ISP), 
correspond to and may in fact replace the display and input 35 and should allow data transfer (downloading) via the net- 
devices 120, 130 shown in FIG, 1. work. 

In the generalized embodiments of the invention There are many known techniques by which processors 

described above (see FIG. 1), the processing system 110, transfer data over a network such as the Internet and either 

which forms the data analysis system, is connected to one or allow access to data bases or to transfer the content of such 

more data bases 100 via the channel 114. In this remote 40 data bases via the network to other systems such as the 

processing embodiment of the invention, the channel is a processing system 110. In the Internet context, several 

conventional data network 614. The network may be internal protocols, such as the File Transfer Protocol (FTP) are 

to the enterprise, such as a standard local area or proprietary standard and well known. Similarly, the techniques used for 

network, for example, connecting many different sites of a transferring display-control data, and for inputting and 

large corporation. This embodiment of the invention is, 45 uploading parameters via a text -based or graphical user 

however, most useful when the network 614 is a wide- area, interface are very well known to any user of, for example, 

publicly accessible network such as the Internet, since it then the Internet. For example, the HTML (hyper-text markup 

allows not only for the widest range of users to take language), XML (extended markup language, and Java are 

advantage of the data categorization, analysis, and visual- commonly used in transfers in order to generate displays, 

ization capabilities of the processing system according to the 50 photos, text, and so on, on a computer display connected to 

invention, but also makes it possible for a user to access, the Internet. Any such conventional transfer protocols and 

analyze, and visualize data in any data base accessible languages may be used to implement the various data 

through the network, that is, even in third-party data bases, transfers and display generation in this invention, 

as long as they are accessible via the network. In this remote processing embodiment of the invention, 

In many corporate enterprise systems, a so-called "fire- 55 the processing system 110 (FIGS. 1 and 3) is hosted 

wall'' 640 is implemented to isolate the hardware and remotely, that is, separate from the enterprise 600. In this 

software components of the system from the public network context, "separate" means that the enterprise 600 is con- 

114 (such as the Internet) in order to protect the system nected to the data analysis system 110 only via the network 

against corruption (such as from viruses) and intrusion (such 614. 

as from "hackers") from outside sources also connected to 60 As is indicated in FIG. 6, the analysis (data processing) 

the network. It is usually desirable to allow at least some system 110 in this remote processing embodiment of the 

contact with other entities via the network 614. One way to invention does not require a separate input and display 

ensure this without loss of security is to include a well- system as shown in FIG. 1, but these devices may of course 

controlled and monitored portal 645 through the firewall. also be included as needed. The analysis system may itself 

This connection may be made even more secure by imple- 65 be provided with a conventional VPN module 655 that 

menting any standard or agreed-upon encryption scheme corresponds to the portal 645 of the enterprise system, 

(for example, any of the widely used "public key infrastruc- Communication over the network (indicated by the dashed 
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portion of the line connecting VPN 645 and VPN 655) can 
thus be made secure, using any conventional encryption 
routine for data transfer between the enterprise system 600 
(in particular, the local processor 610) and the remote, 
hosted data analysis system 110. 

In FIG. 6, the module labeled decision support 660 is used 
to indicate, collectively, the various modules for data typing 
140, range determination 142, structural analysis 144, inter- 
face selection 146 and, if needed, sampling 148 as shown in 
FIGS. 1 and 3, as well as the memory 112. In effect, in this 
remote processing embodiment of the invention, the main 
processing system 110 remains at a host site, but the data 
base(s) 100, user input 130 and display 120 functions are 
localized to the user's site, the connection between the 
processing system and the data to be processed being the 
network 614. 

It is not necessary for all the data designated for analysis, 
classification, visualization and display to be located within 
the enterprise 600. Rather, the invention may also be used to 
process data in one or more publicly available data bases, 
one of which is indicated in FIG. 6 as the data base DBP. 
Note that, in this case, data transfer will in general not be 
secure, as is illustrated by the direct connection between the 
decision support module 660 and the network 614. Such data 
transfer can be carried out using any of the well-known, 
widely used techniques now available for network file and 
data transfer. 

Any known mechanism may be used to identify the 
various systems connected to the network 614. Common 
identifiers for Internet sites include Universal Resource 
Locators (URL's) and Data Source Names (DSN's). The 
methods by which one network entity (such as the enterprise 
600) addresses another (such as the data analysis system 
110) and request and transfer data are well known. Any such 
method may be used according to the invention. 

Assume that a user at the enterprise 600 wants to analyze 
and visualize data in either internal data bases DBl-DBn, or 
an external data base DBP that is addressable via the 
network, or both. The user first submits a request to the 
network (using conventional methods), for example, using 
the browser 620 or using some other known software 
directly through the local processor 610, for access to the 
data analysis system 110. This may be done, for example, by 
submitting the URL of the system 110. Once the connection 
has been established, then the data analysis 110 transmits to 
the user's browser 620 a standard analysis request screen. 
The user then identifies for the analysis system 110 the 
network addresse(s) or similar parameters enabling the data 
analysis system 110 to address and access the data bases 
whose data the user wishes to have visualized. The data 
analysis system 110 then carries out the procedural steps of 
data analysis described above: 1) accessing the user- 
identified data base(s) and exchanging standard protocol 
information (including encryption parameters to establish a 
VPN, if implemented); 2) downloading records and classi- 
fying them by type; 3) determining the range of the data for 
each field in the set; 4) analyzing the relational structure of 
the data records; 5) determining a user interface to be 
displayed on the user's browser 620; and, if needed, 6) 
sampling the data. 

The user may then adjust the interface query devices as 
displayed on the browser 620 (or analogous monitor display) 
and submit the adjustments via the browser 620 using any 
conventional techniques. In FIG. 6, a range slider for a 
parameter PI is shown, as are two check boxes for factors 
Fl and F2 and a hypothetical two -curve display of the 
results. This is of course just one very simplified example of 
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a possible display. FIG. 5 is another. The data analysis 
system 110 then in turn regenerates the display in accor- 
dance with the adjusted input query devices (which form 
display parameters). 

5 The data analysis system 110 preferably also includes an 
repository for various structure-determine algorithms, such 
as linear regression, curve- fitting, neural networks, etc. The 
user may then also select which algorithm be prefers for a 
given data analysis rather than accept the structure algorithm 

10 that the analysis system would select automatically as 
described above. This feature may be needed in certain 
applications, such as clinical drug trials, in which the 
measure of relevance is prescribed. 
Remote Hosting Embodiment 

is FIG. 7 illustrates the general configuration of an embodi- 
ment of the invention in which a host system 700 is 
connected via the network 614 to one or more user systems 
702, 704, 706, which are also labeled USER1, USER2, 
USERn for more ready reference. Merely by way of 

20 example, In FIG. 7, features that are identical to those 
included in previously illustrated embodiments and aspects 
of the invention have retained the same reference numbers. 

In this embodiment, each user system USER1 , . . . 
USERn, as in earlier embodiments, is assumed to include a 

25 standard display capability, some standard input device and, 
in most cases, both. These are not shown in FIG. 7 merely 
for the sake of simplicity. As FIG. 7 illustrates, the user 
systems may, but need not, include or at least be able to 
access one or more local data bases, for example, DBU1 at 

30 the USER1 site, DBU2 and the USER2 site, and so on. Note 
that any user site may contain any number of data bases, 
including none at all. Single data bases are shown in FIG. 7 
for USER1 and USER2 merely for the purpose of illustra- 
tion. 

35 Each user who wishes to be able to view the results of a 
data analysis will also include some conventional display 
software; those who also wish to be able to adjust query 
devices and thus interactively explore relationships in the 
data will also need conventional software that enables 

40 on-screen manipulation of the query devices. A conventional 
browser (such as Internet Explorer or Netscape Navigator) 
is, as is well known, software suitable for both these 
tasks — data presentation and query device adjustment — and 
is therefore preferably included in each user system for each 

45 user participating in the collaborative data analysis accord- 
ing to this embodiment of the invention. In FIG. 7, a browser 
is shown as being included in the USER1 system, but may 
be assumed to be in other user systems as needed. 

In this remote-hosting embodiment of the invention, the 

50 host system includes, as in other embodiments, a processor 
610 or system of processors, which is connected (via a 
conventional I/O device such as a modem, I DSN interface, 
etc. — not shown) to the network 614 either through a 
firewall, or directly, or both, depending on the configuration 

55 preferred in any given implementation. As is explained 
further below, the host system should be able to receive and 
store data for analysis, to analyze the data, to communicate 
the results of the analysis for preferably visual presentation 
to one or more users. As such, in this embodiment of the 

60 invention, the host system serves as a network portal through 
which users may collaboratively explore data that they 
themselves have generated through other conventional 
means, with no need to individually acquire and run data 
analysis software. Improvement to the data analysis routine 

65 can thus be made readily available to all users who partici- 
pate in the system, simply by updating the software in the 
host system. 
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In the preferred embodiment, in which the automatic data 
analysis and query generation routine described above is 
implemented, the host system also needs to be able to sense 
changes by a user to displayed query devices, and to then 
transmit the correspondingly updated visualization of the 5 
data to the user. The host system will therefore in most cases 
be implemented using a conventional network server, which 
typically have all necessary hardware and can be pro- 
grammed using known software techniques to accomplish 
all these tasks. Another advantage of using a server is that it 10 
will typically have enough memory for a large user space, 
and can handle more than one network access at a time, thus 
allowing several different, unrelated data analyses, that is 
data exploration or "data mining" operations, to take place 
at the same time. Even common standard computer systems, 15 
such as "personal computers" will, however, in most cases 
have sufficient memory and include a modem or other 
network-connection hardware to handle at least one user 
who is accessing the data analysis module according to the 
invention. 20 

As in other computer systems, the host system 700 
preferably includes system software 720, such as an oper- 
ating system, various device drivers, etc., all of which are 
well-known in the art of computer science. As is also 
well-known, such system software coordinates, in any 25 
known manner, the transfer of information between the host 
system 700 and the network 614, and it also allocates and 
administers the memory within the host used by applications 
such as the data analysis module 110 according to the 
invention, and an e-mail routine 730 (if included) such as 30 
Microsoft Outlook Express. 

In this remote-hosting embodiment of the invention, a 
user space 740 is allocated within the memory, either the 
system memory, if enough is available, or, preferably, non- 
volatile storage such as disk memory. This user memory 35 
space is then preferably partitioned or allocated into regions 
in such a way that each user who is participating in the 
remotely hosted data analysis system according to this 
embodiment of the invention is allocated a memory region. 
Each user may then transfer, that is, upload, via the network, 40 
either data to be used in an analysis, or network links (such 
as URL's) to such data, or both. By way of example, in FIG, 
7, a memory region has been allocated to each of the users 
USER1, USER2, . . . , USERn, each of whom is assumed to 
be a participant in the system. USERx, however, is assumed 45 
not to be a participating user, and thus has no allocated 
memory region in the user space 740. 

Methods for accessing a server and for uploading data, for 
example, using the File Transfer Protocol, into a dedicated 
memory space of the server are known. For example, so 
web-hosting services use this known uploading procedure to 
allow users to store the HTML (for example) code for their 
web sites, as well as data (including images, executable 
code, etc.) that may be downloaded by those accessing the 
respective web site. 55 

In general, in this embodiment of the invention, users 
upload into their respective user spaces the data and links to 
data that is to be used in the data analysis. The decision 
support system 110 then accesses and analyzes the data as 
before, preferably including automatically generating initial 60 
query devices, and makes the results available to one or 
more of the participating users. These results are then 
transferred in a conventional manner over the network for 
display on one or more user systems, preferably on their 
respective browsers. A user may then alter one or more 65 
queries as described above, for example by adjusting the 
respective displayed query devices and submit the adjusted 
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queries back over the network to the host system, whereupon 
the decision support system 110 regenerates the display 
according to the updated queries. 

FIG. 7 illustrates just a few of the many possible combi- 
nations of data that can be presented for analysis to the 
decision support system 110. In this example, USER1 and 
USER2 have uploaded the data in data bases DB1 and DB2, 
respectively, into their respective user memory regions 
USER1 and USER2, Each user has also uploaded a link: 
USER1 has uploaded a link (LINK DB2) to the DB2 data 
stored in the USER2 region, and USER2 has uploaded a link 
(LINK DBP) to an external data base DBP, which is 
assumed to be accessible via the network, for example, from 
a network server. The data that USER1 wishes to be included 
in the automatic analysis is thus the data in DBUL and 
DBU2, and the data that USER2 wishes to be analyzed is the 
data in DBU2 and DBP. 

According to this embodiment of the invention, one or 
more users (user systems) may thus be the source of data for 
any given analysis, and the results may be made available to 
any user who is participating in the system, or may be sent 
by any participating user, for example by e-mail attachment, 
even to other non-participants. In order to protect the privacy 
of data uploaded into any user memory region, or otherwise 
stored and referenced in the host system, conventional 
password protection may be included to prevent access by 
unauthorized users. In order to protect the integrity of any 
data file, moreover, conventional methods may also be 
employed to restrict it to read-only use- 
In some implementations of the invention, each user will 
be allocated a certain amount of memory in which to upload 
data. This would be similar to the manner in which web- 
hosting services assign to each member a certain amount of 
memory for that member's web site. However, it is possible 
when using this embodiment of the invention for the amount 
of data to be analyzed to be very large, possibly exceeding 
any preset size limit. This situation may be handled in 
several ways. One way is simply for the user system to 
request the host system to allocate more memory, at least 
temporarily. In most cases, any transfer of a data base also 
first involves transferring information concerning the size of 
the data base, the number of data sets included, the number 
of records in each data set, and the number of fields in each 
record. If any data set exceeds some predetermined 
threshold, as mentioned above, then the host system could 
instead direct the user system to transfer only a subset of the 
data according to a sampling procedure, such as those 
described above. 

Still another way is simply to design the host system to 
include enough mass storage to handle all the anticipated 
uploaded data. An estimate of the need may be determined 
using known methods, and memory may be added as the 
number of users or the size of the data sets increases. This 
is, once again, a common problem facing web- hosting 
systems — as the need grows, more servers are added, as well 
as more memory capacity. 

It is not necessary to store the actual uploaded data sets in 
the memory regions of the respective users. Rather, all data 
may be stored in the conventional manner in the memory of 
the host system; the user memory regions will then contain 
the address information to the regions of memory where the 
data is stored in the host system, or the network address of 
externally located data, such as DBP. Such file allocation 
systems are well known in the art of computer science. Note 
that one simple addressing scheme would be to assign each 
stored data set a network address (such as a URL), once 
again, analogous to the manner in which web sites are 
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structured and addressed. Addresses pointing within the host 
system could then be translated to standard file location 
information using a simple allocation and conversion table. 
This would enable uniform addressing of all data sets to be 
used in an analysis, not only by the host system, but also by 
other users who might want to access the data for analyses 
of their own. USER1 could thus easily make his data (for 
example, DBU1) available to USER2, for example, simply 
by telling USER2 what the network address — the link — is to 
DBU1. It would then not be necessary to upload the data into 
the host system again, since it would already be stored there. 
Even users, for example, USERx, who are not participating 
members in the data analysis system could then also access 
the data, given the network address and suitable authoriza- 
tion codes. 

When any user has uploaded into his user region either all 
the data to be used in an analysis, or the links (addresses) to 
the data, or both, he may use any known method to indicate 
to the host system that data analysis should begin. For 
example, the user could "click" on a suitable icon on a 
display generated by his browser upon network connection 
to the host system; the analysis/decision support system 110 
will then proceed to classify and analyze the data included 
and/or indicated in the respective user memory region. 

In most cases it will be preferable to upload all the data 
to be analyzed into the memory of the host system, since 
data access during the analysis will then be much faster than 
the network transfer speed. This also makes the data readily 
accessible to other participating users. Thus, whenever the 
decision support system 110 of the host system is pointed by 
way of a link (for example, a network address) to a data base 
whose data has not already been retrieved into the host 
system memory (either into one of the user regions or into 
some other temporarily or permanently allocated storage 
region), then the host system preferably accesses the data 
base corresponding to the link and retrieves the data. This 
may be done, for example, when the user indicates that data 
analysis should begin, or in accordance with some other 
preparatory command for the host system to retrieve the data 
sets to be used in the later analysis. 

Once the analysis/decision support system has completed 
its initial analysis of the data as described above, it will then 
have selected initial query devices, and will have the display 
data corresponding to the initial analysis, given the selected 
relevance measures, etc. This information, represented in 
FIG. 7 as QUERIESl and RESULT! for USER1 and QUE- 
RIES2 and RESULT2 for USER2, can then be transferred 
for display and adjustment to the respective user's systems 
702, 704. This information could also be transferred for 
display to more than one user, so that these users may 
independently or collaboratively adjust the queries and see 
how the adjustment affects the displayed visualization of the 
relationships between the various data sets included in the 
analysis. Note that once an analysis has been completed, 
there is no need to redo it just because a query has been 
changed; rather, the results will be available for visualization 
even later, and as long as the structure (relevance measures) 
determined by the decision support module is not changed 
by the user, then the user can continue with his analysis by 
viewing the previously stored results (for example, 
RESULTS1) and adjusting the queries as normal. 

According to another feature that may be included in this 
embodiment of the invention, the host system may also 
maintain a log of the actions taken by users with respect to 
any data analysis. In FIG. 7, a log is shown in each user 
memory, that is, LOG1, for USER1, LOG2 for USER2, etc. 
The log may contain, for example, the history of which 
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user(s), which data, and which query states were included in 
a particular analysis, visualization or report, possibly with a 
copy of those results. In effect, the log could, as its name 
implies, be a record of the analysis history for any body of 

s data. Another advantage of maintaining the logs, besides the 
obvious advantage that users can track their work over time, 
is that they would also enable audits. For example, pharma- 
ceutical companies could make their logs available to regu- 
latory authorities in order to validate the results of clinical 

10 studies. 

In addition to, or instead of the logging feature, another 
feature that may be included in this embodiment of the 
invention is e-mail notification of any access or change of 
uploaded data in any user's memory region. Whenever data 

15 in any user's memory region is accessed or used by any user 
besides the one who originally uploaded the data into the 
host, then the host system would transmit a message to this 
effect as e-mail, via the network, to the original user, 
preferably with information identifying the user accessing 

20 the data. Such accesses could then also be logged in the 
user's memory region. This would not only increase the data 
security of the system, but it would also be useful for 
coordinating multiple analyses involving, at least in part, the 
same data. 

25 This remote-hosting embodiment of the invention is par- 
ticularly advantageous when more than one researcher wants 
to explore data, which may have been generated by any or 
all of them. It also allows them to link into even other 
externally generated data, as long as it is available in a 

30 known format via the network (for example, demographic or 
meteorological data often made available by governmental 
agencies), and thus explore possible relationships with data 
gathered from outside their own research team. Results (in 
particular, display data that visualize an especially interest- 

35 ing relationship) can also be sent, for example, as an 
attachment in any conventional format to e-mail generated 
and transferred in the conventional manner by the e-mail 
module 730, to anyone who is able to access the network and 
who has a browser or similar software that is able to receive 

40 and display the data. 

The automatic data classification and analysis method 
described above is of course the preferred method carried 
out by the analysis/decision support system 110, since its 
advantages are just as beneficial in this remotely hosted 

45 embodiment of the invention as in any other. Indeed, it is 
particularly advantageous in this embodiment, since users 
may submit for analysis data from several different sources, 
so that it will often be especially difficult for the user to 
know ahead of time what postulated relationships (relevance 

so measures) are most likely to yield interesting and perhaps 
even surprising results. 

On the other hand, the main feature of this embodiment of 
the invention is that one or more users can upload data for 
analysis into the host system, which carries out the actual 

55 analysis; the host system thus acts as a network portal and 
needs only to be able to access the uploaded user-specified 
data to be included in the analysis. As such, this embodiment 
of the invention does not presuppose any particular analysis 
routine. The analysis method described above, with auto- 

60 matic selection of relevance measures and/or even query 
devices, is preferred because of its flexibility and ability to 
so effectively make it possible for users to visualize and even 
discover relationships about the data. 

Rather, the analysis/decision support system 110 could 

65 implement a particular, pre-determined analysis routine, or a 
library of possible analysis and visualization methods (for 
example, linear regression, polynomial, trigonometric or 
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similar data -fitting algorithms, neural networks with prede- 
termined initial structures, etc.). Along with submitting data 
for analysis, the user could then also select which of the 
analysis routines should be applied, for example, by select- 
ing it from a browser-displayed pull-down menu. Of course, S 
as is mentioned above, this feature may be needed in certain 
applications, such as clinical drug trials, in which the 
measure of relevance is prescribed. 

The analysis/decision support module could, for example, 
also implement conventional text-based (for example, 10 
keyword-driven) querying and reporting, or multi- 
dimensional, hierarchical data analysis and visualization. 
The user could, for example, then, after or in conjunction 
with uploading data into the host system, upload text queries 
and view results as in conventional systems, except that, 15 
using this embodiment of the invention, the data being 
analyzed will have been uploaded from one or more users 
into the host portal. 

Because this remote-hosting embodiment of the invention 
is not dependent on any particular analysis routine, it is also 20 
not limited to any particular data structure. Using the pre- 
ferred analysis routine, with automatic selection of rel- 
evance measures and initial query devices, data in the data 
bases will typically be organized into records, with each 
record including at least one field. The actual data structure 25 
used to organize the data uploaded into the host system will, 
however, depend on which type of analysis routine is to be 
invoked for the analysis. If the data is to be used for 
conventional text-based searching, then the data structure 
can be a simple one -dimensional list. In general, for any 30 
n-dimensional analysis, visualization, text-based report, etc., 
the data should typically be able to be classified into at least 
a corresponding number n of different sets that can be 
compared using some measure of relevance. 

What is claimed is: 35 

1. A method for processing data from at least one data 
base, in which each data base contains a plurality of records 
and each record includes a plurality of data fields, and each 
field contains field data, has a field name and one of a 
plurality of data types, comprising the following steps: 40 

receiving into a host system, via a network, the data from 
the at least one data base from at least one participating 
remote user system that is separate from the host 
system, 

in the host system, upon receipt of a request for initiation 45 

from the remote user system, analyzing the data from 

the at least one data base according to an analysis 

routine and generating analysis results; 
in the host system, generating a representation of the 5Q 

analysis results; and 
transferring the representation of the analysis results via 

the network for display on at least one participating 

remote user system; 
in a decision support module in the host system, auto- 55 

matically selecting an initial, adjustable, graphical 

query device as a function of and adapted to a type and 

range of the corresponding field data; 
transferring each graphical query device via the network 

to at least one participating user system; 60 
sensing, via the network, adjustment by the user of each 

participating user system to which each graphical query 

device has been transferred of any of the displayed, 

adjustable, graphical query devices; and 
in the host system, updating the representation of the 

analysis results corresponding to the sensed adjust- 
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ments of any of the query devices, thereby enabling 
interactive visualization of the analysis results of the 
data via the network. 

2. A method as in claim 1, in which at least one of the user 
systems to which graphical query devices are transferred is 
one of the participating user systems other than the partici- 
pating source user system. 

3. A method as in claim 1, further including the step of 
allocating, for each participating user system, a correspond- 
ing memory region in the host system, each memory region 
storing: 

data from the at least one data base transferred via the 
network from the respective participating user system 
to the host system; and 

a log of accesses to the data stored in the respective 
memory regions. 

4. A method as in claim 1, further including the step of 
notifying, via the network, each user whose corresponding 
data, stored in the respective memory region, is accessed by 
any other participating user. 

5. A method for processing and visualizing data from at 
least one data base, in which each data base contains a 
plurality of records and each record includes a plurality of 
data fields that include field data, comprising the following 
steps: 

receiving in a host system, via a network, from at least one 
remote participating user system separate from the host 
system, the data from the at least one data base; 

in the host system, upon receipt of a request for initiation 
from the remote user system, analyzing the data from 
the at least one data base by detecting a relational 
structure between the data fields by calculating a 
respective relevance measure for each of the data fields, 
the relevance measure being a data type -dependent 
function indicating a measure of relational closeness 
between data in at least one of the data fields of the 
plurality of records to data in at least one other of the 
data fields of the plurality of records; 

in the host system, generating a graphical representation 
of the relational structure; 

transferring the graphical representation of the relational 
structure via the network for display on at least one 
participating user system; 

for each of the data fields, in a decision support module 
in the host system, automatically selecting an initial, 
adjustable, graphical query device as a function of and 
adapted to the type and range of the corresponding field 
data; 

transferring each graphical query device via the network 
to at least one participating user system; 

sensing, via the network, adjustment by the user of each 
participating user system to which each graphical query 
device has been transferred of any of the displayed, 
adjustable, graphical query devices; and 

in the host system, updating the graphical representations 
of the relational structures corresponding to the sensed 
adjustments of any of the query devices, thereby 
enabling interactive visualization of the relational 
structures of the data fields via the network. 

6. A method as in claim 5, which at least one of the user 
systems to which graphical query devices are transferred is 
one of the remote participating user systems other than the 
initiating, participating source user system. 
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